%matplotlib inline
・kaggleに掲載されたランジェリーデータセット。
https://www.kaggle.com/PromptCloudHQ/innerwear-data-from-victorias-secret-and-others
・2017年6月から2017年7月にかけECサイトからデータ抽出した。
分析対象となるECサイトとブランド
・Victoria's Secret:公式サイト、Amazon
・Wacoal(Btemptd):公式サイト、Amazon、Macy's
・Calvin Klein:公式ウェブサイト、Amazon、Macy's
・Hanky Panky:公式ウェブサイト、Amazon、Macy's
・Topshop USA
売上世界第4位のアパレルチェーン企業「リミテッド・ブランズ」社の傘下で2013年は、北アメリカに総数1045の店舗を展開、年商6403億円。
店舗はピンクを基調としたセクシーなデザインで、下着や水着をはじめとする同社の取り扱い商品の多くはセクシー路線を特徴とする。
ファッションショーやカタログ通信販売に力を入れており、これらにトップモデルを登場させるスタイルも話題を呼んでいる。
最近は売り上げは低迷し、そのセクシー過ぎる広告は顧客を遠ざけている。
1977年にロイ・レイモンドによってサンフランシスコに設立された。
レイモンドは1982年にこの会社をリミテッド・ブランズに売却し、その傘下に入った。
北アメリカを中心に多くの店舗を有しているものの、主流はカタログ通信販売となっている。
京都市に本拠を置く、日本の衣料品メーカーで事業の中心は、女性用下着販売。
アメリカでb.tempt'dとういブランドを展開している。
フィット感やサポートを売りにしている。
Kleinは1964年にニューヨークのファッション工科大学を卒業し、自身の会社を設立した。
彼の名前のブランドの下着は、エキゾチックなポーズでほぼ裸のモデルをフィーチャーした看板とプリント広告に助けられて、1980年代にヒットした。
クラインは後に他の製品、特に香水に力を入れた。
2002年に彼の会社はPhillips-Van Heusen Corporationに買収された。
アンダーウェアの概念を超えた豊富なカラーバリエーションはベーシックカラーに加え、シーズンごとに発表される鮮やかなカラーも大きな魅力。
大学教授だったLida Orzeckは、働く女性=パンツスタイルで威厳を保つという時代だった当時、パンツスーツを愛用し教壇に立っていた。
そんな彼女の悩みは、黒板に向かう後姿のパンティライン。
視線が気になりGストリングスというラインが出ない下着を着用していましたが、食い込んでしまいとても不快…。
この悩みを友人Gale Epsteinに打ち明けたところ、器用な彼女は快適なはき心地でラインが響かないアンダーウェアをハンカチを用いて手作りでプレゼントしてあげた。
その後、評判が広まりリクエストの声が高まったため2人は1977年に“Hanky Panky”を創設した。
世界20カ国以上で展開するイギリスのファッション小売店。
低価格かつファッション性の高い服、いわゆるファストファッション。
日本企業もファストファッションの事業モデルを取り入れるようになっており、日本での人気は低迷し撤退している。
・各ブランドの最大小売価格(MRP)の範囲は?
・最も高価な商品は?
・商品カテゴリ毎の各ブランドの最大小売価格(MRP)の範囲は?
・ブランド間で展開している色の範囲は?
・商品カテゴリ毎の色の範囲は?
・ブランド間での評価は?
・商品カテゴリ毎のブランド間での評価は?
・最も割引されているブランドは何か?
・商品カテゴリ毎に最も割引されているブランドは何か?
・どの商品カテゴリが他の商品カテゴリよりレビューされているか?
・どの製品のレビューが最も少ないか?
・各ブランドでサポートされている一般的でユニークな色は何か?
・レビューに基づいた一般的な消費者の感情は何か?
import csv
import pandas as pd
import seaborn as sns
import matplotlib as plt
from matplotlib import pyplot as pyplt
import numpy as np
import matplotlib
import codecs
import json
import requests
def describe_dataset(df):
print(df.iloc[0]["retailer"])
print("Number of rows: %d" %(df.shape[0]))
display(df.head(1))
pd.set_option("display.max_rows", 101)
df_list=[]
aerie_df = pd.read_csv("./input/ae_com.csv")
aerie_df['ec_site'] = 'aerie'
aerie_df['official'] = '1'
df_list.append(aerie_df)
calvin_klein_df = pd.read_csv("./input/calvinklein_com.csv")
calvin_klein_df['ec_site'] = 'calvin_klein'
calvin_klein_df['official'] = '1'
df_list.append(calvin_klein_df)
amazon_df = pd.read_csv("./input/amazon_com.csv")
amazon_df['ec_site'] = 'amazon'
amazon_df['official'] = '0'
df_list.append(amazon_df)
btempted_df = pd.read_csv("./input/btemptd_com.csv")
btempted_df['ec_site'] = 'wacoal'
btempted_df['official'] = '1'
df_list.append(btempted_df)
hanky_panky_df = pd.read_csv("./input/hankypanky_com.csv")
hanky_panky_df['ec_site'] = 'hanky_panky'
hanky_panky_df['official'] = '1'
df_list.append(hanky_panky_df)
macys_df = pd.read_csv("./input/macys_com.csv")
macys_df['ec_site'] = 'macys'
macys_df['official'] = '0'
df_list.append(macys_df)
nordstrom_df = pd.read_csv("./input/shop_nordstrom_com.csv")
nordstrom_df['ec_site'] = 'nordstrom'
nordstrom_df['official'] = '1'
df_list.append(nordstrom_df)
topshop_df = pd.read_csv("./input/us_topshop_com.csv")
topshop_df['ec_site'] = 'topshop'
topshop_df['official'] = '1'
df_list.append(topshop_df)
victoriassecret_df = pd.read_csv("./input/victoriassecret_com.csv")
victoriassecret_df['ec_site'] = 'victoriassecret'
victoriassecret_df['official'] = '1'
df_list.append(victoriassecret_df)
merged_df = pd.DataFrame()
for df in df_list:
merged_df = merged_df.append(df)
merged_df.columns
Index(['product_name', 'mrp', 'price', 'pdp_url', 'brand_name',
'product_category', 'retailer', 'description', 'rating', 'review_count',
'style_attributes', 'total_sizes', 'available_size', 'color', 'ec_site',
'official'],
dtype='object')
for df in df_list:
describe_dataset(df)
Ae US Number of rows: 28328
| product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color | ec_site | official | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aerie Everyday Loves Lace Cheeky | 12.50 USD | 12.50 USD | https://www.ae.com/aerie-everyday-loves-lace-c... | AERIE | Cheekies | Ae US | Introducing Everyday Loves™: Made with love. E... | 5.0 | 8.0 | ["Soft lace with the right amount of stretch",... | ["XS", "S", "M", "L", "XL", "XXL"] | ["XS", "S", "M", "L", "XL", "XXL"] | Rugged Green | aerie | 1 |
Calvin Klein US Number of rows: 4747
| product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color | ec_site | official | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | logo cotton stretch thong | $13.00 | $13.00 | http://www.calvinklein.us/en/womens-clothing/w... | Calvin Klein | 3 FOR 33 PANTY ESSENTIALS | Calvin Klein US | soft cotton stretch fabric and a metallic logo... | NaN | NaN | ["cotton stretch thong panty","metallic elasti... | s,m,l | s,m,l | BLACK | calvin_klein | 1 |
Amazon US Number of rows: 31612
| product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color | ec_site | official | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Calvin Klein Women's Sheer Marquisette Demi Un... | $36.00 | $32.40 | https://www.amazon.com/-/dp/B01NAVD98J?th=1&psc=1 | Calvin-Klein | Bras | Amazon US | An unlined demi cup bra featuring sheer, sexy ... | 4.5 | 47 | [ 72% Nylon, 28% Elastane , Imported , hook an... | 30B , 30C , 30D , 30DD , 32A , 32B , 32C , 32D... | 30B , 30C , 30D , 30DD , 32B , 32C , 32D , 32D... | Bare | amazon | 0 |
Btemptd US Number of rows: 3518
| product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color | ec_site | official | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | b.tempt'd Ciao Bella Bralette | $30.00 | $30.00 | http://btemptd.wacoal-america.com/b-tempt-d-ci... | WACOAL | COLLECTIONS | Btemptd US | Say “buongiorno!” to this ladylike piece that ... | NaN | NaN | [Wire free bralette • Cut and sew corded lace ... | XS,S,M,L,XL | l,m,s,xl,xs | Bridal White | wacoal | 1 |
Hankypanky US Number of rows: 35005
| product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color | ec_site | official | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Silky 20” A-line Half Slip with Lace | $68 | $68 | http://www.hankypanky.com/collections/silky-20... | HankyPanky | Collections | Hankypanky US | Hanky Panky Silky is the ideal fabric for unde... | NaN | NaN | ["Just above-the-knee-length half-slip with el... | ["Select", "S", "M", "L", "XL"] | ["Select", "S", "M", "L", "XL"] | Black | hanky_panky | 1 |
Macys US Number of rows: 40897
| product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color | ec_site | official | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ID String Bikini QF1754 | $20.00 | $20.00 | http://www1.macys.com/shop/product/calvin-klei... | Calvin Klein | Women - Lingerie & Shapewear - Designer Lingerie | Macys US | The perfect amount of coverage in a subtle sil... | NaN | NaN | ["Thin elastic waistband ", "Repeating logo at... | ["XS", "S", "M", "L", "XL"] | ["XS", "S", "M", "L", "XL"] | Black | macys | 0 |
Nordstrom US Number of rows: 12568
| product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color | ec_site | official | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 'B Fitting' High Cut Briefs | $15.00 | $15.00 | http://shop.nordstrom.com/s/wacoal-b-fitting-h... | WACOAL | Women's Panties | Nordstrom US | Lighter-than-air, full-cut Supima® cotton brie... | 4.2 | 65.0 | ["79% Supima® cotton, 21% spandex.", "Hand was... | [nil] | [nil] | Black | nordstrom | 1 |
Topshop US Number of rows: 3082
| product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color | ec_site | official | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MATERNITY Criss Cross Knickers | $20.00 | $20.00 | http://www.topshop.com/en/tsus/product/clothin... | US TOPSHOP | Lingerie | Topshop US | These feminine black knickers for maternity fe... | NaN | NaN | NaN | 4,6,8,10,12 | 4,6,8,10,12 | BLACK | topshop | 1 |
Victoriassecret US Number of rows: 453386
| product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color | ec_site | official | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Very Sexy Strappy Lace Thong Panty | $14.50 | $14.50 | https://www.victoriassecret.com/panties/shop-a... | Victoria's Secret | Strappy Lace Thong Panty | Victoriassecret US | Lots of cheek peek, pretty lace, a strappy bac... | NaN | NaN | NaN | ["XS", "S", "M", "L", "XL"] | S | peach melba | victoriassecret | 1 |
・mrp : 最大小売価格
・price : 販売価格
import re
colon_to_dollar_conversion_rate = 0.0017
ind_rp_to_dollar_conversion_rate = 0.000066
def extract_usd_value(value_str):
if type(value_str) is str:
value_str= value_str.strip().lower().replace('usd', '').replace('$','').replace("\-.*","").strip()
value_str = re.sub(r"-.*", "", value_str)
value_str = re.sub(r"–.*", "", value_str)
value_str = re.sub(r"\s.*", "", value_str)
value_str.strip()
if "₡" in value_str:
value_str = value_str.replace("₡", "").strip()
value_str = pd.to_numeric(value_str) * colon_to_dollar_conversion_rate
elif "rp" in value_str:
value_str = value_str.replace("rp", "").strip()
value_str = pd.to_numeric(value_str) * ind_rp_to_dollar_conversion_rate
return value_str
else:
return value_str
merged_df['mrp']=merged_df['mrp'].apply(extract_usd_value).apply(pd.to_numeric)
merged_df['price']=merged_df['price'].apply(extract_usd_value).apply(pd.to_numeric)
pd.value_counts(merged_df['brand_name']).plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1a15af56a0>
def standardize_brand_names(brand_name):
brand_name = brand_name.lower()
brand_name = brand_name.replace("-", " ")
if "hanky" in brand_name:
brand_name = "hanky panky"
elif "calvin" in brand_name:
brand_name = "calvin klein"
elif "wacoal" in brand_name or "tempt" in brand_name:
brand_name = "b.tempt'd"
elif "victorias" in brand_name:
brand_name = "victoria's secret"
elif "aeo" in brand_name:
brand_name = "aerie"
brand_name = brand_name.strip()
return brand_name
merged_df['brand_name'] = merged_df['brand_name'].apply(standardize_brand_names)
pd.value_counts(merged_df['brand_name']).plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1a22051a20>
pd.value_counts(merged_df['brand_name'])
victoria's secret 342600 victoria's secret pink 110853 hanky panky 48302 b.tempt'd 45273 calvin klein 31251 aerie 28328 us topshop 3082 vanity fair 2575 nordstrom lingerie 870 s 3 nintendo 1 compression comfort 1 creative motion 1 fila 1 sexy hair 1 lucky brand 1 Name: brand_name, dtype: int64
Nintendo、Sexy HairとSはおかしなブランド名で、 おそらくAmazonのデータセットだろう。
Victoria's Secretのデータセットは他のブランドよりはるかに大きい。
ただし、すべてのデータセットは、一定期間内に個々の小売業者のWebページから抽出されたものだ。
したがって、より大きなデータセットは次のように説明できます。
・より大きな在庫があある。
・Amazonが自社ブランドを販売している。
・件数の少ない(100件未満)ブランドは有益でないため削除する。
MIN_RECORDS = 100
for item in pd.value_counts(merged_df['brand_name']).iteritems():
brand = item[0]
count = item[1]
if(count < MIN_RECORDS):
merged_df.drop(merged_df[merged_df.brand_name == brand].index, inplace=True)
pd.value_counts(merged_df['brand_name'])
victoria's secret 342594 victoria's secret pink 110850 hanky panky 48286 b.tempt'd 45271 calvin klein 31246 aerie 28322 us topshop 3081 vanity fair 2574 nordstrom lingerie 870 Name: brand_name, dtype: int64
product_categories = pd.value_counts(merged_df['product_category'])
print("There are %d categories! "% (product_categories.shape[0]))
There are 525 categories!
product_categories.plot.pie(figsize=(10,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1a19da2630>
categories = {
"bra" : ["bra", "push-up", "classic", "tomgirl", "collections", "longline", "bridget", "hannah", "audrey", "lorna jane", "katie", "brooke", "push", "padded", "demi", "scoop", "full coverage", "wireless", "plunge"],
"panty" : ["panty", "panties", "brief", "hiphugger","cheekies", "thong", "hipster", "cheekster", "short", "bottom", "undies"],
"bikini" : ["bikini", "triangle", "one-piece", "one piece", "high-neck", "hineck"],
"top" : ["tee", "top", "tank", "halter", "bandeau", "racerback", "cami", "crop"],
"lingerie": ["slip", "garter", "babydoll", "lingerie", "teddy", "sleepwear", "fishnet", "robe", "kimono", "bodysuit", "romper", "tunic"],
"shapewear": ["shapewear", "bustier"],
"socks" : ["sock"],
"leggings": ["legging"],
"bottle": ["bottle"],
"hoodie" : ["full-zip"],
"kit" : ["kit", "duffle"],
"petal": ["petal"]
}
def standardize_product_category(row):
product_category = row["product_category"]
product_name = row["product_name"]
product_name = product_name.lower()
product_category = product_category.lower()
for group, items in categories.items():
for item in items:
if item in product_category:
return group
for group, items in categories.items():
for item in items:
if item in product_name:
return group
return product_category
merged_df["product_group"] = merged_df.apply(standardize_product_category, axis=1)
product_group = pd.value_counts(merged_df['product_group'])
display(product_group)
print("There are %d categories! "% (product_group.shape[0]))
product_group.plot.pie(figsize=(10,10))
bra 436367 panty 124253 lingerie 39247 top 6077 bikini 5775 shapewear 1019 hoodie 93 socks 82 leggings 66 bottle 39 baseball hat 27 petal 22 kit 18 washed canvas tote 9 Name: product_group, dtype: int64
There are 14 categories!
<matplotlib.axes._subplots.AxesSubplot at 0x1a184a01d0>
merged_df['product_group'].value_counts(dropna=False, normalize=True)
bra 0.711746 panty 0.202665 lingerie 0.064015 top 0.009912 bikini 0.009419 shapewear 0.001662 hoodie 0.000152 socks 0.000134 leggings 0.000108 bottle 0.000064 baseball hat 0.000044 petal 0.000036 kit 0.000029 washed canvas tote 0.000015 Name: product_group, dtype: float64
・商品カテゴリはTop5で99%を占める。
・件数の少ない(1,000件未満)カテゴリは有益でないため削除する。
| 位 | 英名 | 和名 | 件数 | 割合 | 累積 |
|---|---|---|---|---|---|
| 1 | bra | ブラ | 436,394 | 0.71 | 0.71 |
| 2 | panty | パンティ | 124,270 | 0.20 | 0.91 |
| 3 | lingerie | ランジェリー | 39,252 | 0.06 | 0.97 |
| 4 | top | トップ | 6,077 | 0.01 | 0.98 |
| 5 | bikini | ビキニ | 5,775 | 0.01 | 0.99 |
| 6 | shapewear | 矯正下着 | 1,019 | 0.00 | 0.99 |
MIN_RECORDS = 1000
for item in product_group.iteritems():
category = item[0]
count = item[1]
if(count < MIN_RECORDS):
merged_df.drop(merged_df[merged_df.product_group == category].index, inplace=True)
merged_df["product_group"] = merged_df.apply(standardize_product_category, axis=1)
product_group = pd.value_counts(merged_df['product_group'])
display(product_group)
print("There are %d categories! "% (product_group.shape[0]))
product_group.plot.pie(figsize=(10,10))
bra 436140 panty 124123 lingerie 39141 top 6069 bikini 5768 shapewear 1016 Name: product_group, dtype: int64
There are 6 categories!
<matplotlib.axes._subplots.AxesSubplot at 0x1a183fad30>
merged_df_official = merged_df[merged_df['official'].isin(['1'])]
merged_df_not_official = merged_df[merged_df['official'].isin(['0'])]
import gensim
import numpy as np
from collections import Counter
from sklearn import datasets
import matplotlib.pyplot as plt
from wordcloud import WordCloud
with open("stopwords.txt") as f:
stopwords = f.read()
stopwords = stopwords.split()
docs = merged_df["product_name"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
merged_df_victoria= merged_df[merged_df['brand_name'].str.startswith('victoria')]
docs = merged_df_victoria["product_name"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
merged_df_hanky_panky= merged_df[merged_df['brand_name'].str.startswith('hanky panky')]
docs = merged_df_hanky_panky["product_name"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
merged_df_wacoal= merged_df[merged_df['brand_name'].str.startswith('b.tempt')]
docs = merged_df_wacoal["product_name"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
merged_df_calvin= merged_df[merged_df['brand_name'].str.startswith('calvin')]
docs = merged_df_calvin["product_name"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
merged_df_aerie= merged_df[merged_df['brand_name'].str.startswith('aerie')]
docs = merged_df_aerie["product_name"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
docs = merged_df["description"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
docs = merged_df_victoria["description"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
docs = merged_df_hanky_panky["description"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
merged_df_wacoal= merged_df[merged_df['brand_name'].str.startswith('b.tempt')]
docs = merged_df_wacoal["description"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
docs = merged_df_calvin["description"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
docs = merged_df_aerie["description"]
texts = [
[w for w in doc.lower().split() if w not in stopwords]
for doc in docs
]
count = Counter(w for doc in texts for w in doc)
y = [i[1] for i in count.most_common()]
dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[0]
num_topics = 12
lda = gensim.models.ldamodel.LdaModel(
corpus=corpus,
num_topics=num_topics,
id2word=dictionary
)
plt.figure(figsize=(30,30))
for t in range(lda.num_topics):
plt.subplot(5,4,t+1)
x = dict(lda.show_topic(t,200))
im = WordCloud().generate_from_frequencies(x)
plt.imshow(im)
plt.axis("off")
plt.title("Topic #" + str(t))
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["font.size"] = 18
sns.boxplot( x=merged_df["mrp"], y=merged_df["brand_name"], palette = "Pastel2").set_title('各ブランドの最大小売価格(MRP)の範囲')
Text(0.5, 1.0, '各ブランドの最大小売価格(MRP)の範囲')
sns.boxplot( x=merged_df_official["mrp"], y=merged_df_official["brand_name"], palette = "Pastel2").set_title('officialサイト:各ブランドの最大小売価格(MRP)の範囲')
Text(0.5, 1.0, 'officialサイト:各ブランドの最大小売価格(MRP)の範囲')
sns.boxplot( x=merged_df_not_official["mrp"], y=merged_df_not_official["brand_name"], palette = "Pastel2").set_title('officialじゃない:各ブランドの最大小売価格(MRP)の範囲')
Text(0.5, 1.0, 'officialじゃない:各ブランドの最大小売価格(MRP)の範囲')
sns.boxplot( x=merged_df_not_official["mrp"], y=merged_df_not_official["ec_site"], palette = "Pastel2").set_title('ECサイト毎の最大小売価格(MRP)の範囲')
Text(0.5, 1.0, 'ECサイト毎の最大小売価格(MRP)の範囲')
・vanity fairは、最も手頃な価格のブランド
・ほとんどの下着は、平均価格約30-40 USDで、最大価格200 USD以内で購入できる。
・Hanky Pankyは最も高価な600USDの商品がある。
・高額商品はofficialサイトで販売。
・Amazonは高額商品を扱わない。
・Macysは百貨店なので高額商品を扱うが、オフシャルサイトほど高額は扱わない。
merged_df[merged_df["mrp"] > 400]['product_name'].value_counts()
30 Pack Low Rise Thongs in Lucite Jar 25 30 Pack Original Rise Thongs in Lucite Jar 24 Name: product_name, dtype: int64
merged_df[merged_df["mrp"] > 400]['product_category'].value_counts()
Thongs 27 Collections 22 Name: product_category, dtype: int64
merged_df[merged_df["mrp"] > 400]['description'].value_counts()
For the ultimate Hanky Panky collector! This limited edition jar opens to 30 of your favorite low rise signature lace thongs. Perfect for gifting or treating yourself! 25 For the ultimate Hanky Panky collector! This limited edition jar opens to 30 of your favorite original rise signature lace thongs. Perfect for gifting or treating yourself! 24 Name: description, dtype: int64
merged_df[merged_df["mrp"] > 400]['style_attributes'].value_counts()
["One-size thong in our signature stretch lace ", " Fits Sizes 2-12 best (hips measuring 35\"-42\")", " Low Rise fits lower on the hips ", " The World’s Most Comfortable Thong® ", " Hanky Panky’s revolutionary and flattering V-front, V-back waistband—our signature design for over 25 years", " No visible panty line (VPL)", " Set includes the following colors: Vivid Coral, Ballet Pink, Bliss Pink, Chai, Vanilla, Buttercream, Pistachio Ice, Powder Blue, Forget Me Not, Dove Grey, Red, Sizzle Pink, Peach Smoothie, Lemoncello, Lime Sherbert, Periwinkle, True Blue, Bali Blue, Violet Crush, Hot Lilac, Enchanted Rose, Passionate Pink, Salsa, Sunkissed, Sweet Mint, Calypso, Sapphire, Electric Orchid, Electric Purple and Granite", " Body: 100% Nylon; Trim: 90% Nylon, 10% Spandex; Crotch Lining: 100% Supima® Cotton", " Made in the USA"] 25 ["One-size thong in our signature stretch lace", " Fits sizes 4-14 best (hips measuring 36\"-45\") ", " Original Rise fits higher on the hips ", " The World’s Most Comfortable Thong® ", " Hanky Panky’s revolutionary and flattering V-front, V-back waistband—our signature design for over 25 years", " No visible panty line (VPL)", " Set includes the following colors: Vivid Coral, Ballet Pink, Bliss Pink, Chai, Vanilla, Buttercream, Pistachio Ice, Powder Blue, Forget Me Not, Dove Grey, Red, Sizzle Pink, Peach Smoothie, Lemoncello, Lime Sherbert, Periwinkle, True Blue, Bali Blue, Violet Crush, Hot Lilac, Enchanted Rose, Passionate Pink, Salsa, Sunkissed, Sweet Mint, Calypso, Sapphire, Electric Orchid, Electric Purple and Granite", " Body: 100% Nylon; Trim: 90% Nylon, 10% Spandex; Crotch Lining: 100% Supima® Cotton", " Made in the USA"] 24 Name: style_attributes, dtype: int64
merged_df[merged_df["mrp"] > 400]['total_sizes'].value_counts()
["One Size"] 41 ["Select", "One Size"] 8 Name: total_sizes, dtype: int64
merged_df[merged_df["mrp"] > 400]['available_size'].value_counts()
["One Size"] 41 ["Select", "One Size"] 8 Name: available_size, dtype: int64
merged_df[merged_df["mrp"] > 400]['color'].value_counts()
Rainbow 49 Name: color, dtype: int64
merged_df[merged_df["mrp"] > 400]['product_group'].value_counts()
panty 27 bra 22 Name: product_group, dtype: int64
・高額商品はセット販売
・色は虹色とあるが、花束のような形でギフトとして販売している。
・男性が女性にプレゼントする際に、実店舗で買うと恥ずかしいが、ネットなら恥ずかしくない。
・形状がoneサイズで良い作りになっている。
・プレゼントの仕方もアレンジすれば嫌らしくない。
・最近のアメリカ人はセクシー路線の下着に飽きた?
plt.rcParams["figure.figsize"] = (10, 5)
import matplotlib as plt
from matplotlib import *
product_group = pd.value_counts(merged_df['product_group'])
for item in product_group.iteritems():
category = item[0]
product_group_df = merged_df[merged_df["product_group"] == category]
plt.pyplot.figure()
title = "MRP range across brands for "+ category
sns.boxplot( x=product_group_df["mrp"], y=product_group_df["brand_name"], palette = "Pastel2").set_title(title)
・ブラジャーとパンティーで多くのブランドが勝負している。
pd.value_counts(merged_df["color"])
Black 36913
White 18068
black 13121
pure black 7083
Ensign 6738
Almost Nude 6271
bayberry 5438
Hello Lovely 5159
Sheer Pink 4985
Radiating Aztec 4861
Black Marl 4735
buff 4554
Burnished Lilac 4518
True Black 4385
triumph white 4261
Naturally Nude 4027
Ivory 3520
coconut white 3488
Black Lace 3427
white 3264
Navy 3184
Cool Maroon 3154
ensign blue 2925
Coconut White 2770
Coconut White Lace 2663
Rosy Mauve 2639
midnight tropical 2635
matisse blue 2509
Blackberry 2447
red tropical print 2380
Bare 2356
Chai 2254
champagne 2234
nude 2181
Fir 2138
New Taupe Lace Mix 2054
Cool Maroon Lace Mix 2044
Night 2041
Black Crossdye 1982
Angel Pink Crossdye 1982
Sheer Pink Solid Lace 1971
Black Solid Lace 1941
Red 1929
Powder Blush Lace Back 1869
Cappuccino 1864
Coconut White Lace Mix 1837
purple petal 1819
Laced Arrows 1811
Black Pearl Lace Back 1788
Sterling Pewter 1785
...
TBC 2
Chevron Outline Logo/Grey Heather 2
black hearts print 2
Peach Fizz 2
Celeste 2
Strata Toasted Coconut 2
TRANQUIL BLUE 2
Porcelain Rose, Very Violet, White Jade 2
Red Dahlia 2
Alaskan Blue 2
INKED ANIMAL 2
Rosewater 2
Turkish Tile 2
Pink Wood 2
Super Pink 2
One Size 2
Honey Heather 1
French Blue 1
Tanimal All Over Lace 1
DARK MIDNIGHT 1
Neon Calypso 1
One Size Fits Most 1
UNDONE 1
Dark Purple 1
Seafoam Green 1
Unique/Bitter/White 1
Hot Pink 1
Black Coffee 1
Potent Purple 1
Lotus Gypset 1
Flash 1
WOVEN LOGO 1
SLATE 1
DUSTY LILAC 1
All Over Lace Shock Pink 1
Violet Tulle 1
Pastel Green 1
Null 1
Peacoat/Polka Dot Accent 1
ASTONISH 1
Gray Colorblock 1
French Roast/Unique/Continuous Skin 1
EMERALD 1
Ink Blue/Black 1
Gossamer Green 1
BLUE PULSE 1
Platinum Combo 1
SHIFTING PLANES LOGO-ASTONISH 1
Stippled Skin Print 1
FLASH 1
Name: color, Length: 2528, dtype: int64
categories = {
"green" : ["green", "emerald", "fir", "bayberry"],
"blue" : ["blue", "navy", "teal", "denim", "azure", "celeste", "turkish", "sea"],
"white" : ["white", "ivory", "cashew", "coconut", "marshmallow"],
"black" : ["black", "midnight", "night"],
"red": ["red", "candy apple", "ginger glaze", "plum", "maroon", "ruby","cherry", "strawberry"],
"yellow": ["yellow", "gold"],
"orange" : ["orange"],
"pink": ["pink", "rosewater", "fuschia", "blush", "peach", "lotus"],
"nude" : ["nude", "bare", "champagne"],
"gray" : ["gray", "grey", "pewter", "slate", "silver"],
"brown" : ["brown", "taupe", "chai", "cappuccino", "sienna", "toast", "french roast"],
"maroon" : ["maroon"],
"purple" : ["mauve", "lilac", "purple", "violet", "grape"]
}
def standardize_color_group(color):
color_category = color
try:
color_category = color_category.lower()
for group, items in categories.items():
for item in items:
if item in color_category:
return group
except:
print(color_category)
return color_category
merged_df['color'] = merged_df['color'].fillna('unknown')
merged_df["color_group"] = merged_df['color'].apply(standardize_color_group)
color_group = pd.value_counts(merged_df['color_group'])
display(color_group)
print("There are %d categories! "% (color_group.shape[0]))
color_group.plot.pie(figsize=(10,10))
color_group.to_csv('color_mania.csv')
black 115227
white 69733
blue 52789
pink 51950
red 42730
nude 30987
purple 27603
green 22306
gray 19548
brown 15340
ensign 7819
buff 5802
hello lovely 5159
radiating aztec 4861
yellow 3872
lip smacker lace 1890
laced arrows 1811
trilobel marl 1780
neon sunset logo print 1774
cocoon 1721
ensign lace mix 1699
dazzle 1440
fair orchid 1389
moonray 1381
ensign with solid lace 1345
ensign with wildflower lace back 1225
spring rain 1156
ensign crochet lace 1084
sand 1076
granite 1023
neon nectar 1022
cocoon palm print 998
spring rain with mini logo 985
forever young 945
fair orchid crochet lace 930
ensign allover jacquard 922
basil tropical print 891
neon princess 884
dark heather 871
english rose lace 867
winter rose with solid lace 862
warm spraypaint triangles 861
northstar 860
soft muslin 854
large 852
small 843
medium 834
current coral 826
chintz rose allover jacquard 805
rose luster 804
...
rose beige/sweet cream 8
cactus 8
cool blocked stripe 7
rose beige/truffle 7
warm sharp angles 7
birds of a feather print 7
cool linear waves 7
delicate skin 7
holly 6
cool gradient 6
shell/buff 6
ensign foil lace with embellishment 6
cranberry 6
dots 6
neon lemon 6
something wild 6
warm gradient 5
plus-ladylike classic 5
curved stripe print 5
niagara falls 5
deep berry 5
fair orichid 5
ladylike classic 5
regular-laguna beach 5
ginger paisley 4
cool sharp angles 4
soft begonia floral 3
light aglow 3
warm linear waves 3
luminous fuchsia 3
etched skin 3
star ferry 3
gradiant logo sultry 3
colorblock multi 3
one size 2
flash 2
tbc 2
berry sweet 2
high society 2
honey heather 1
peacoat/polka dot accent 1
shifting planes logo-astonish 1
platinum combo 1
neon calypso 1
one size fits most 1
stippled skin print 1
tanimal all over lace 1
astonish 1
null 1
woven logo 1
Name: color_group, Length: 834, dtype: int64
There are 834 categories!
/Users/hikaru_no_5_0208/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:23: FutureWarning: The signature of `Series.to_csv` was aligned to that of `DataFrame.to_csv`, and argument 'header' will change its default value from False to True: please pass an explicit value to suppress this warning.
merged_df.groupby(["brand_name"])["color_group"].nunique().plot.bar(title='#Colors supported by brand')
<matplotlib.axes._subplots.AxesSubplot at 0x1a35a2f588>
color_count_df = merged_df.groupby(["brand_name", "product_group"]).color_group.nunique()
color_count_df = color_count_df.reset_index()
plot = sns.boxplot(x='color_group', y='brand_name', data=color_count_df)
plot.set_title("#Colors per product group vs brand")
plot.set_xlabel('#colors per product type')
Text(0.5, 0, '#colors per product type')
merged_df.groupby('brand_name').product_name.agg(['count', 'nunique'])
| count | nunique | |
|---|---|---|
| brand_name | ||
| aerie | 28256 | 106 |
| b.tempt'd | 45130 | 395 |
| calvin klein | 31126 | 442 |
| hanky panky | 48184 | 896 |
| nordstrom lingerie | 864 | 23 |
| us topshop | 3041 | 296 |
| vanity fair | 2568 | 46 |
| victoria's secret | 342542 | 429 |
| victoria's secret pink | 110546 | 172 |
plot = sns.boxplot(x='product_group', y='color_group', data=color_count_df)
plot.set_title("#Colors supported across brands vs product group")
plot.set_ylabel('#colors for product group')
Text(0, 0.5, '#colors for product group')
・各ブランドで様々な色を用意している
pd.value_counts(merged_df['rating'])
4.4 37103 4.3 29563 4.5 25065 5.0 20133 4.2 17798 4.0 17781 4.6 13444 4.7 8820 3.7 8673 3.9 8033 4.1 7790 3.8 7504 4.8 6741 3.6 5060 3.2 4300 4.9 3380 3.3 2827 3.5 2471 3.0 2133 1.0 1184 2.0 509 3.1 503 3.4 459 2.9 448 2.8 360 2.6 339 2.5 244 2.7 161 1.5 160 0.0 124 2.4 52 2.3 34 2.2 8 2.1 8 1.7 8 Name: rating, dtype: int64
・評価(rating)は1.0 から5.0の値
pyplt.figure(figsize=(15,8))
plt.rcParams["font.size"] = 15
plot = sns.boxplot(x='brand_name', y='rating', data=merged_df)
plot.set_title("Rating range for each brand")
plot.set_ylabel('Brand')
Text(0, 0.5, 'Brand')
merged_df.columns
Index(['product_name', 'mrp', 'price', 'pdp_url', 'brand_name',
'product_category', 'retailer', 'description', 'rating', 'review_count',
'style_attributes', 'total_sizes', 'available_size', 'color',
'product_group', 'color_group'],
dtype='object')
・Nordstromの評価は悪い。
・victoria's secretはHanky Pankyより評価が低い。しかし在庫が多いたのが影響しているかも。
・Hanky Pankyの評価は高い。
product_group = pd.value_counts(merged_df['product_group'])
for item in product_group.iteritems():
category = item[0]
product_group_df = merged_df[merged_df["product_group"] == category]
plt.pyplot.figure()
title = "Ratings range across brands for "+ category
sns.boxplot( x=product_group_df["rating"], y=product_group_df["brand_name"], palette = "Pastel2").set_title(title)
・Hanky Pankyはパンティとブラジャーで評価が高い。
・Victoria Secretsの評価はNordstrom以外の全ブランドより悪い。
merged_df['discount'] = merged_df['mrp'] - merged_df['price']
merged_df['discount%'] = merged_df['discount'] *100 / merged_df['mrp']
pyplt.figure(figsize=(15,8))
plot = sns.boxplot(x='brand_name', y='discount%', data=merged_df)
plot.set_title("Discount in $ for each brand")
plot.set_ylabel('Brand')
Text(0, 0.5, 'Brand')
・AerieとCalvin KleinとB'Temptdは日常的に割引している。
・少量だが非常に大きな割引で購入できる商品がある。
for item in product_group.iteritems():
category = item[0]
product_group_df = merged_df[merged_df["product_group"] == category]
plt.pyplot.figure()
title = "Discount% range across brands for "+ category
sns.boxplot( x=product_group_df["discount%"], y=product_group_df["brand_name"], palette = "Pastel2").set_title(title)
・ほとんどブランドが大幅な値引きをしている。
・Calvin Kleinはパンティーを大幅に割引しているが、ブラジャーはそれほど多くない。
・Hanky Pankyもいくらか割引がある。